Analyzing Pitchfork Album Reviews: A Sentiment
and Network Analysis of the Rating-Review Discrepancy

by Josh Iden


Introduction ¶


Pitchfork is a highly influential, industry-leading online music publication producing original album reviews in addition to features, news articles, and more. Pitchfork's album review ratings system assigns a score from 0 to 10 based on internal polling. An album review score can have significant effect on artists' careers, affecting sales and marketing, live touring and festival opportunities, licensing placement and fees, digital streamining playlists and radio pickup, physical distribution orders, future partnership opportunities and terms, and more. Generally a rating score of 7.0 or above is considered good, while 8.0 or above is considered excellent. Conversely, a score below 6.5 is considered poor, while a score below 6.0 is considered bad.

This project analyzes 6000 reviews collected between December 1, 2017 and April 27, 2023 through web scraping the Pitchfork website, and asks the following questions:

  • Is there a relationship between the quality of a review and its rating?
  • Can a network analysis of writers and record labels derive centrality measures that can establish a relationship between a writer or record label's influence within the network and the rating an album receives.

The inspiration for this project is this: as a long-time Pitchfork reader, I've frequently read reviews that didn't seem to match the rating the album received; the phenomenon of albums with low ratings receiving what I perceived to be favorable reviews, and (less frequently), vice versa. I've also wondered if a particular writer or label had an effect on the rating.


Summary of Findings¶


Data was filtered to focus on new reviews, removing 336 reissue reviews from the dataset, as ratings distribution of reissues skewed towards higher ratings: the average rating amongst reissues between 2017 and 2023 is 8.6, whereas the average rating amongst new releases during the same period of time is 7.2.

VADER (Valence Aware Dictionary for Sentiment Reasoning) Sentiment Analysis of reviews scored 45.8% of reviews as positive or very positive, 42% of reviews as average, and 12.2% of reviews as negative or very negative (only 0.2% of reviews rated "very negative"). There is no observable relationship between sentiment score and pitchfork rating, as the average rating of all new reviews, 7.2, was consistent across all sentiment categories, and several albums that received a rating of 10 had reviews that were scored by the VADER algorithm as "average" or "negative". Attempts to improve the algorithm's accuracy proved fruitless.

Network Analysis of Pitchfork writers with respect to genres covered identified the writers who are potentially the most influential and connected as observed through degree centrality. These are not the writers that have written the most reviews. Other centrality measures were observed, but were highly collinear. Amongst writers with at least 10 reviews, none of the writers with the ten highest average ratings had the ten highest degree centrality measures. No correlation between centrality measures and ratings were observed.

Network Analysis of record labels with respect to genres of music proved fruitless, as nearly 69% of record labels had albums reviewed from one single genre. However, the record labels with the ten highest average ratings of their albums of labels with at least ten reviews had a mean number of genres of 5 -- indicating that labels which release records in more than one genre tend to receive better ratings. A network analysis of labels with at least ten reviews identified the most influential and connected labels within the network, but no observable relationship exists between centrality measures and average Pitchfork rating.

Scatterplots and Correlation Matrices of the data indicated no correlation between the dependent variable (rating), and any of the independent variables collected. Predictive modeling of the dataset was unsuccessful.


Contents ¶

  • Data Acquisition
    • Data Cleaning
  • Exploratory Data Analysis
  • Sentiment Analysis
    • VADER Analysis
    • Lexical Analysis
  • Network Analysis
    • Writers Analysis
    • Labels Analysis
  • Predictive Modeling
  • Appendix A: Web Scraping
  • Appendix B: Data Tidying
  • Appendix C: Predictive Modeling

Data Acquisition ¶


A dataset of 6000 reviews was acquired by scraping Pitchfork's website using a original python scripts. The following data fields were extracted from the Pitchfork website:

  • Artist, the artist name
  • Album, the album being reviewed
  • Genre, the genre of music
  • Label, the record label releasing the album
  • Release Year, the year the album is released
  • Author, the writer of the review
  • Published, publishing date of the review
  • Best New Music, whether the review received Best New Music
  • Best New Reissue, whether the review received Best New Reissue
  • Score, album's ratings score
  • Review, full text of the album review
  • URL, link to the review

Web scraping code can be found in Appendix A. One note: from time to time, Pitchfork produces multi-album reviews by a single artist. An example can be found here. These multi-album reviews are typically older releases, and are featured as a carousel selection of the Pitchfork website. As a result, these reviews require modifications to the web scraper in order to retrieve the review text in the same format as the single-album reviews, which proved challenging, and since this analysis focuses on new reviews, the multi-album reviews were excluded from the data before saving to disk.

In [46]:
# previewing the full dataset
data.head()
Out[46]:
artist album genre label releaseYear writer published bestNewMusic bestNewReissue rating review url
0 The National First Two Pages of Frankenstein Rock 4AD 2023 Brad Shoup 2023-04-27 False False 6.6 ’s ninth album, First Two Pages of Frankenstei... https://pitchfork.com/reviews/albums/the-natio...
1 Bill Evans Treasures: Solo, Trio and Orchestra Recordings... Jazz Elemental Music 2023 Mark Richardson 2023-04-27 False False 8.1 If’is the most common entry point for a new ja... https://pitchfork.com/reviews/albums/bill-evan...
2 Remi Kabaka Son of Africa Rock BBE Music 2023 Dean Van Nguyen 2023-04-27 False False 7.4 Musician, a Ghanaian-born Nigerian, first land... https://pitchfork.com/reviews/albums/remi-kaba...
3 Portrayal of Guilt Devil Music Rock Run for Cover 2023 Marty Sartini Garner 2023-04-27 False False 7.4 There are so many ways music can make people u... https://pitchfork.com/reviews/albums/portrayal...
4 Jim Legxacy HNPM Rap / Experimental (!) 2023 Dylan Green 2023-04-26 False False 7.8 Rapper-producer opens “miley’s riddim,” the th... https://pitchfork.com/reviews/albums/jim-legxa...

Data Cleaning ¶


The following steps were taken to clean the data:

  • Special characters and spaces were removed from the review column
  • author column is renamed writer
  • score column is renamed rating
  • releaseYear column is converted to string and stripped of its float
  • Two subsets of data are created:
    • data_new: new albums
    • data_reissues: reissue albums

Exploratory Data Analysis ¶


Ratings¶


Let's look at the ratings distribution of reissues and new albums,

In [7]:
dr = data_reissues.rating.describe().round(1).reset_index().rename(columns={'rating':'reissues'})
dn = data_new.rating.describe().round(1).reset_index().rename(columns={'rating':'new albums'})
print(tabulate(pd.merge(dr,dn,on='index'), headers="keys", tablefmt="github", showindex=False, numalign="right"))
| index   |   reissues |   new albums |
|---------|------------|--------------|
| count   |        336 |         5301 |
| mean    |        8.6 |          7.2 |
| std     |        0.8 |          0.8 |
| min     |        5.6 |          1.6 |
| 25%     |        8.1 |          6.8 |
| 50%     |        8.7 |          7.3 |
| 75%     |        9.2 |          7.7 |
| max     |         10 |           10 |
In [8]:
fig, axes = plt.subplots(1, 2, figsize=(21, 6), layout='constrained')

# reissues
sns.boxplot(ax=axes[0], x=data_reissues.rating)
axes[0].title.set_text('reissues')

# new albums
sns.boxplot(ax=axes[1], x=data_new.rating)
axes[1].title.set_text('new albums')

for i in range(len(axes)):
    axes[i].set_xlim(0,10)

Both datasets have a similar standard deviation, although the mean rating as noted in findings is substantially higher amongst reissues than new albums. We see our new albums dataset now includes 5301 observations. We also see there are a lot of outliers in the ratings of new albums below 5.5. Let's take a look at the distribution of ratings over time to see if there are any noticeable trends to observe.

In [9]:
fig, axes = plt.subplots(1, 3, figsize=(21, 5), layout='constrained')

# daily
sns.lineplot(ax=axes[0], data=daily_ratings, x='published', y='avg_rating')
axes[0].title.set_text('daily')

# weekly
sns.lineplot(ax=axes[1], data=weekly_ratings, x='published', y='avg_rating')
axes[1].title.set_text('weekly')

# monthly
sns.lineplot(ax=axes[2], data=monthly_ratings, x='published', y='avg_rating')
axes[2].title.set_text('monthly')

ylab = ['avg_rating','','']

for i in range(len(axes)):
    axes[i].set_xlabel('')
    axes[i].set_ylabel(ylab[i])
    axes[i].set_ylim(0,10)
    axes[i].axhline(y=np.mean(data.rating), color='r');

plt.show();

Here, ratings are grouped into three buckets: average rating by day, week, and month. The red represents the mean rating. We can observe that the ratings tend fluctuate highest by day, and even out over time. Nevertheless, we can observe on a monthly basis, average ratings were very low at the end of 2017/beginning of 2018, and fluctuated from low to high in the 4th quarter of 2022. The daily ratings give us a another sense of how the lower scores may be influencing the overall ratings.

In [11]:
data_new.describe(include="all")
Out[11]:
artist album genre label releaseYear writer published bestNewMusic bestNewReissue rating review url
count 5301 5301 5301 5301 5301 5301 5301 5301 5301 5301.000000 5301 5301
unique 3687 5248 91 1964 7 343 1581 2 1 NaN 5301 5301
top Various Artists DJ-Kicks Rock self-released 2018 Philip Sherburne 2019-04-11 False False NaN ’s ninth album, First Two Pages of Frankenstei... https://pitchfork.com/reviews/albums/the-natio...
freq 13 8 1567 175 1072 244 6 5068 5301 NaN 1 1
mean NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.182079 NaN NaN
std NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.824410 NaN NaN
min NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.600000 NaN NaN
25% NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.800000 NaN NaN
50% NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.300000 NaN NaN
75% NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.700000 NaN NaN
max NaN NaN NaN NaN NaN NaN NaN NaN NaN 10.000000 NaN NaN

Writers¶


Let's take a look at writers, the number of reviews per writer, and the distribution of average ratings amongst those writers.

In [12]:
writers_df.describe()
Out[12]:
total_reviews avg_rating genres pct_total
count 343.000000 343.000000 343.000000 343.000000
mean 15.454810 7.227959 3.886297 0.002924
std 26.458897 0.475075 3.529560 0.004987
min 1.000000 5.400000 1.000000 0.000200
25% 2.000000 6.995000 1.000000 0.000400
50% 5.000000 7.220000 3.000000 0.000900
75% 16.000000 7.500000 5.000000 0.003000
max 244.000000 9.300000 23.000000 0.046000

343 writers have written reviews over this time frame, the average writer has written 15 reviews, but we can see the median number of reviews is 5.

In [13]:
writers_df.sort_values(by='total_reviews', ascending=False).round(5).head(10)
Out[13]:
writer total_reviews avg_rating genres pct_total
0 Philip Sherburne 244 7.53 18 0.0460
1 Ian Cohen 146 6.97 10 0.0275
2 Stuart Berman 131 7.48 14 0.0247
3 Ben Cardew 123 7.10 23 0.0232
4 Alphonse Pierre 121 6.59 4 0.0228
5 Evan Rytlewski 114 6.75 14 0.0215
6 Sheldon Pearce 112 7.03 9 0.0211
7 Andy Beta 104 7.25 16 0.0196
8 Sam Sodomsky 101 7.60 7 0.0191
9 Grayson Haver Currin 99 7.40 19 0.0187
In [14]:
sns.scatterplot(data=writers_df, x='total_reviews', y='avg_rating')
plt.axhline(y=np.mean(data_new.rating), color='r');

We can see that as a writer writes more reviews, the average rating gets closer to the mean. This is due more to the likelihood that a low rating amongst fewer reviews will be more influential than for writers with more reviews under their belt.

In [8]:
# writers with one review vs all other writers 
fig, axes = plt.subplots(1, 2, figsize=(21, 6), layout='constrained')

# writers with one review
sns.histplot(ax=axes[0], data=writers_df[writers_df['total_reviews'] == 1], x='avg_rating')
axes[0].axvline(data_new.rating.mean(), c='r', lw=1.5)
axes[0].title.set_text('one review')

# all other writers                    
sns.histplot(ax=axes[1], data=writers_df[writers_df['total_reviews'] != 1], x='avg_rating')
axes[1].axvline(data_new.rating.mean(), c='r', lw=1.5)
axes[1].title.set_text('all others')
                     
print('Total writers with one review: {}'.format(writers_df[writers_df['total_reviews'] == 1].writer.nunique()))
Total writers with one review: 81

We can see that the distribution of ratings for writers with one review tends to skew towards lower ratings.


Labels¶


Let's take a look at the labels.

In [16]:
print("Total Labels: {}".format(labels_df.label.nunique()))
print("Total Genres: {}".format(data_new.genre.nunique()))

x = labels_genre.label.value_counts().reset_index().label
sns.countplot(x)
plt.xlabel('number of genres')
plt.ylabel('number of labels')
plt.title('number of genres per label')
plt.plot();
Total Labels: 1964
Total Genres: 91
/Users/joshiden/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

In [106]:
one_genre = labels_df[labels_df['total_genres'] == 1].label.count()
total_pct = round(one_genre / labels_df.label.nunique() * 100, 1)

print('Total number of labels with reviews in one genre: {}'.format(one_genre))
print('Percentage of labels with reviews in one genre: {}%'.format(total_pct))
Total number of labels with reviews in one genre: 1613
Percentage of labels with reviews in one genre: 82.1%
In [18]:
sns.scatterplot(data=labels_df, x='total_genres', y='avg_rating')
plt.axhline(y=np.mean(data_new.rating), color='r');

It appears that the more genres a label is releasing records in, the closer the average rating is to the mean.


Sentiment Analysis ¶


In this section, I take two approaches to try gain better insight into the contents of the reviews and how they may relate to Pitchfork ratings using the VADER algorithm and a lexical analysis.


VADER Sentiment Analysis ¶


VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool included in the Natural Language Toolkit (NLTK) library in Python. VADER uses a lexicon (a list of words and their associated sentiment scores) to analyze the sentiment of a given text. The lexicon contains over 7,500 words and their associated positive or negative sentiment scores, as well as intensity modifiers that adjust the strength of the sentiment.

VADER outputs sentiment scores for the text on four dimensions: positive, negative, neutral, and compound. The positive, negative, and neutral scores represent the proportion of words in the text that are classified as positive, negative, or neutral. The compound score is a single number between -1 and 1 that represents the overall sentiment of the text, with -1 indicating extremely negative sentiment, 0 indicating neutral sentiment, and 1 indicating extremely positive sentiment.

The downside to the VADER algorithm is that it is designed specifically for analyzing sentiment in social media texts. However, we apply to the reviews by splitting the text of the reviews into individual sentences and combining each sentence score.

We start by defining a function to convert each score to a sentiment category.

In [8]:
# function to rank each review
def rank(score):
    '''provides a ranking based on a score'''
    if score <= -.32:
        return "very negative"
    if score <= -0.0:
        return "negative"
    if score <= .16:
        return "average"
    if score <= .32:
        return "positive"
    if score <= 1:
        return "very positive"

Next, we instantiate the SentimentIntensityAnalyzer object, and define a function to read through the reviews and assign scores and sentiments using the rank function we defined above.

In [9]:
sia = SentimentIntensityAnalyzer()

def sentence_score(reviews):
    '''compiles an aggregate score based on the sentences of a review'''
    sentence_scores = []
    sentence_sentiment = []

    for review in reviews:
        scores = [sia.polarity_scores(sentence)["compound"] for sentence in nltk.sent_tokenize(review)]
        scores = round(np.mean(scores),3)  
        sentence_scores.append(scores)
        sentiment = rank(scores)
        sentence_sentiment.append(sentiment)
        
        no_punc = re.sub(r'[^\w\s]', ' ', review).replace('  ', ' ')
        words = nltk.word_tokenize(no_punc)
        
    return sentence_scores, sentence_sentiment

data_new['score'], data_new['sentiment'] = sentence_score(data_new.review)

%time
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.15 µs
In [21]:
fig, axes = plt.subplots(1, 3, figsize=(21, 5), layout='constrained')

# rating
sns.scatterplot(ax=axes[0], data=data_new, x='score', y='rating')
axes[0].title.set_text('rating ~ score')

# score
sns.histplot(ax=axes[1], x=data_new.score)
axes[1].axvline(data_new.score.mean(), c='r', lw=0.8)
axes[1].title.set_text('score')

# sentiment
sns.countplot(ax=axes[2], x=data_new.sentiment, order=['very negative','negative','average','positive','very positive'])
axes[2].title.set_text('sentiment')

ylab = ['','','']

for i in range(len(axes)):
    axes[i].set_xlabel('')
    axes[i].set_ylabel(ylab[i]);

We see a few things here:

  • The scatter plot indicates there is no perceivable relationship between sentiment score and rating
  • As we can see in the second figure above, most of the sentiment scores are above 0.
  • Most ratings are on the average- to- positive side of sentiments.
In [22]:
p = sns.FacetGrid(data_new, col="sentiment", height=3, aspect=1)
p.map(sns.histplot, "rating");

We can see that the distribution of ratings amongst the different sentiment categories are still centered around the mean of 7.2, and indeed this is the case.

In [23]:
print(tabulate(data_new.groupby('sentiment')['rating'].mean().round(2).reset_index().sort_values(by='rating', ascending=False), headers=['sentiment','average rating'], tablefmt="github", showindex=False, numalign="right"))
| sentiment     |   average rating |
|---------------|------------------|
| positive      |             7.21 |
| very positive |              7.2 |
| negative      |             7.19 |
| very negative |             7.17 |
| average       |             7.15 |

In this case I include the average rating rounded to two decimal places so you can see that there is slight variation in the average ratings, and that the lowest average rating is amongst the "average" sentiment.

We can calculate and store the average score by writer to observe which writers write more positive reviews than others, and any relationship that might exits between those scores and other measures such as rating.

In [85]:
title = 'Top 10 Highest Average Sentiment Scores by Writer \n'
print(title)
print(tabulate(data_new.groupby('writer')['score'].mean().round(4).reset_index().sort_values(by='score', ascending=False).head(10), headers='keys', tablefmt="github", showindex=False))
Top 10 Highest Average Sentiment Scores by Writer 

| writer             |   score |
|--------------------|---------|
| Nazuk Kochhar      |   0.461 |
| Frankie Caracciolo |   0.423 |
| Dash Lewis         |   0.419 |
| Lawrence Burney    |   0.391 |
| Brandon Stosuy     |   0.351 |
| Maria Barrios      |   0.347 |
| Shamira Ibrahim    |   0.327 |
| Jesse Weiss        |   0.316 |
| Tyra Nicole Triche |   0.304 |
| Chris L. Terry     |   0.304 |
In [10]:
# store average sentiment score by writer
writer_scores = data_new.groupby('writer')['score'].mean().round(4).reset_index().rename(columns={'score':'avg_score'})

Lexical Analysis ¶


As sentiment analysis using the VADER algorithm failed to yield results that could connect a relationship between sentiment and Pitchfork, I next sought to take a look at which words appeared most frequently in different categories of reviews to determine if a lexical approach might yield quantifiable metrics that could be connected with ratings. To do this, I took the following steps:

  • Create a corpus using the entire review text
  • Remove stopwords such as "if", "and", "but"
  • Lemmatize all words to their roots
  • Remove the 100 most frequently used words

The goal here was to identify if there were any clear distinctions of words that are used more frequently amongst albums that received higher or lower Pitchfork ratings.

First, I created a couple functions to deal with the review text.

In [11]:
# create a corpus of reviews
def create_corpus(dfcolumn):
    '''takes a dataframe and returns a single string of its contents'''
    corpus = ''
    for row in dfcolumn:
        corpus += row.lower()
        
    return corpus        

# define a function to look through the corpus and return a frequency distribution object
def review_words(column, dropwords=[]):
    '''takes a dataframe column and an optional list of dropwords and returns a frequency distribution object of its words'''
    corpus = create_corpus(column)
    # tokenize the words
    words = nltk.word_tokenize(corpus)
    
    # retain only alphanumeric words
    words = [w for w in words if w.isalpha()]

    # remove stopwords 
    stopwords = nltk.corpus.stopwords.words("english")
    words = [w for w in words if w.lower() not in stopwords]
    
    # Lemmatizing words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # remove dropwords
    words = [word for word in words if word not in dropwords]

    # create frequency distribution 
    fd = nltk.FreqDist(words)
    
    return fd

Now I look at the entire corpus and split the frequency distributions up into groups of 25 for better visualizion.

In [13]:
fd = review_words(data_new.review)
In [26]:
top25 = pd.DataFrame(fd.most_common(25), columns=['word','count'])
top50 = pd.DataFrame(fd.most_common()[25:50], columns=['word','count'])
top75 = pd.DataFrame(fd.most_common()[50:75], columns=['word','count'])
top100 = pd.DataFrame(fd.most_common()[75:100], columns=['word','count'])
In [27]:
fig, axes = plt.subplots(2, 2, figsize=(21, 12), layout='constrained')

# 1-25
sns.barplot(ax=axes[0,0], data=top25, y='word',x='count')
axes[0,0].title.set_text('Top Words: 1-25')

# 25-50
sns.barplot(ax=axes[0,1], data=top50, y='word',x='count')
axes[0,1].title.set_text('Top Words: 26-50')

# 50-75
sns.barplot(ax=axes[1,0], data=top75, y='word',x='count')
axes[1,0].title.set_text('Top Words: 51-75')

# 75-100
sns.barplot(ax=axes[1,1], data=top100, y='word',x='count')
axes[1,1].title.set_text('Top Words: 76-100')

ax_ct = [(0,0),(0,1),(1,0),(1,1)]
for i in ax_ct:
    axes[i].set_xlabel('')
    axes[i].set_ylabel('');

As we can see, there's a lot of replacable words that don't tell you much of anything related to music. Let's grab some of the musical words to keep from this list and remove the rest.

In [28]:
print(list(dict(fd.most_common(100)).keys()))
['like', 'album', 'song', 'sound', 'music', 'one', 'feel', 'track', 'new', 'time', 'record', 'band', 'even', 'make', 'year', 'way', 'voice', 'first', 'guitar', 'work', 'moment', 'much', 'vocal', 'love', 'two', 'every', 'life', 'pop', 'take', 'also', 'still', 'come', 'back', 'world', 'beat', 'could', 'never', 'rap', 'might', 'get', 'though', 'drum', 'sings', 'artist', 'hear', 'line', 'lyric', 'best', 'something', 'last', 'around', 'melody', 'rock', 'little', 'end', 'made', 'often', 'thing', 'would', 'sense', 'long', 'go', 'many', 'yet', 'always', 'part', 'past', 'debut', 'title', 'le', 'without', 'production', 'know', 'another', 'producer', 'kind', 'word', 'single', 'day', 'find', 'piano', 'since', 'solo', 'set', 'place', 'point', 'release', 'project', 'u', 'play', 'early', 'rapper', 'together', 'feeling', 'second', 'almost', 'enough', 'good', 'people', 'say']
In [29]:
# create list of words to drop
words_to_keep = ['sound','feel','voice','moment','vocal','love','life','pop','take','beat','rap','hear','lyric',
                'best','melody','rock','sense','debut','production','producer','piano','solo','rapper','feeling','good']

dropwords = [word for word, count in fd.most_common(100) if word not in words_to_keep]
len(dropwords)
Out[29]:
75

Now we reproduce the frequency distribution without the 75 most frequent words we've discarded, and observe the distribution of words by ratings group.


Word Frequency By Rating¶


In [30]:
# reviews by rating
review_8s = review_words(data_new[data_new['rating'] >= 8.0].review, dropwords)
review_7s = review_words(data_new[(data_new['rating'] >= 7.0) & (data_new['rating'] < 8.0)].review, dropwords)
review_6s = review_words(data_new[(data_new['rating'] >= 6.0) & (data_new['rating'] < 7.0)].review, dropwords)
review_5s = review_words(data_new[data_new['rating'] < 6.0].review, dropwords)

# top words by rating
word_8s = pd.DataFrame(review_8s.most_common(25), columns=['word','count'])
word_7s = pd.DataFrame(review_7s.most_common(25), columns=['word','count'])
word_6s = pd.DataFrame(review_6s.most_common(25), columns=['word','count'])
word_5s = pd.DataFrame(review_5s.most_common(25), columns=['word','count'])
In [31]:
fig, axes = plt.subplots(2, 2, figsize=(21, 12), layout='constrained')

# 8s
sns.barplot(ax=axes[0,0], data=word_8s, y='word',x='count')
axes[0,0].title.set_text('8.0+')

# 7s
sns.barplot(ax=axes[0,1], data=word_7s, y='word',x='count')
axes[0,1].title.set_text('7.0-7.9')

# 6s
sns.barplot(ax=axes[1,0], data=word_6s, y='word',x='count')
axes[1,0].title.set_text('6.0-6.9')

# below 6
sns.barplot(ax=axes[1,1], data=word_5s, y='word',x='count')
axes[1,1].title.set_text('below 6')

ax_ct = [(0,0),(0,1),(1,0),(1,1)]
for i in ax_ct:
    axes[i].set_xlabel('')
    axes[i].set_ylabel('');

The only interesting observation I glean from this data is one word that is frequently used in reviews for albums of every group that receive below an 8 but is not amongst the most-frequent words for albums that receive an 8 or higher: production. However, every other category appears to similar words, making it challenging to quantify a distinction between categories. It does appear that there are many repeating words. Let's take a look at the most frequenly occurring trigrams to see if there are any distinctions to glean.

In [32]:
def trigram_extractor(column):
    '''takes a dataframe column and returns a frequency distribution object of its words'''
    corpus = create_corpus(column)
    # tokenize the words
    words = nltk.word_tokenize(corpus)
    
    # retain only alphanumeric words
    words = [w for w in words if w.isalpha()]

    # remove stopwords 
    stopwords = nltk.corpus.stopwords.words("english")
    words = [w for w in words if w.lower() not in stopwords]
    
    # collect trigrams 
    finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
    
    return finder
In [33]:
t_grams = trigram_extractor(data_new.review)
In [34]:
# 20 most common trigrams
print(tabulate(t_grams.ngram_fd.most_common(20), headers=['trigram','count']))
trigram                            count
-------------------------------  -------
('catch', 'every', 'saturday')      1429
('every', 'saturday', 'albums')     1429
('saturday', 'albums', 'week')      1429
('albums', 'week', 'sign')          1429
('week', 'sign', 'hear')            1429
('sign', 'hear', 'newsletter')      1429
('buy', 'catch', 'every')            707
('rock', 'n', 'roll')                206
('new', 'york', 'city')              130
('drum', 'n', 'bass')                125
('songs', 'feel', 'like')             55
('often', 'feels', 'like')            49
('album', 'feels', 'like')            49
('might', 'sound', 'like')            49
('first', 'two', 'albums')            48
('nearly', 'every', 'song')           47
('feels', 'less', 'like')             46
('sounds', 'like', 'could')           44
('two', 'years', 'later')             42
('album', 'title', 'track')           39

If we removed the top 7 trigrams (the variations of "catch every saturday"), there actually aren't a ton of frequently occurring trigrams considering we're looking at 5300 reviews.


Network Analysis ¶


In this section, we perform network analysis on two sets of variables in the dataset to try to establish a relationship between the writer and genre variables, and the label and genre variables.

The idea here is that the more genres a writers or record labels engage with, the more connected they will be amongst the entire network of pitchfork writers.


Writers Analysis ¶


First I construct a dataframe with each observation consisting of a writer, a genre, and the number of times the writer covers the genre:

In [12]:
writer_genre = pd.DataFrame(pd.Series(data_new.groupby(['writer', 'genre']).size(), name="count")).reset_index()
writer_genre.head()
Out[12]:
writer genre count
0 Abby Jones Folk/Country 1
1 Abby Jones Pop/R&B / Electronic 1
2 Abby Jones Rock 34
3 Abigail Covington Experimental 1
4 Abigail Covington Folk/Country 1

Next I construct a bipartite graph object and add writers and genres as nodes. The edges are the interactions between writers and genres, and the count is added as a weight to represent the number of times the writer and genre interact.

In [13]:
# list of unique writers
writers = list(writer_genre.writer.unique())
# list of unique genres
genres = list(writer_genre.genre.unique())

# instantiate Graph object
G = nx.Graph()

# add bipartite edges, 0: writer, 1: genre
G.add_nodes_from(writers, bipartite=0)
G.add_nodes_from(genres, bipartite=1)

# add edges
for index, row in writer_genre.iterrows():
    G.add_edge(row[0], row[1], weight=row[2])
In [16]:
# checking the graph is connected
nx.is_connected(G)
Out[16]:
True
In [17]:
# checking graph is bipartite
nx.is_bipartite(G)
Out[17]:
True

We can graph the entire network, but with 343 writers and 91 it's quite challenging to draw a legible bipartite graph,

In [18]:
plt.figure(figsize=[21,16])

#colors = range(200)
options = {
    "node_color": "#A0CBE2",
    "edge_color": "lightgrey",
    "width": 0.3,
    "edge_cmap": plt.cm.Blues,
    "with_labels": True,
}

top = nx.bipartite.sets(G)[0]
pos = nx.bipartite_layout(G, top)

nx.draw(G, pos, **options);

Another option, which further illustrates the challenge of creating a two-dimensional bipartite graph, is a spring layout.

In [41]:
seed=456 # set seed for reproducibility
plt.figure(figsize=[20,12])

# Assign colors to the top and bottom nodes
top_nodes = {n for n, d in G.nodes(data=True) if d['bipartite']==0} # writers
bottom_nodes = set(G) - top_nodes # genres
node_colors = ['lightblue' if n in top_nodes else 'lightyellow' for n in G.nodes()]

# grab the edge weights to use for edge width
weights = [edata['weight']/2 for n,e,edata in G.edges(data=True)]

# grab the degrees to use for node size
degs = [v*500 for k, v in G.degree()]

# Set up the layout for the graph
pos = nx.spring_layout(G)

# Draw the grah with different colors for writers and genres
nx.draw(G, pos, width=weights, node_size=degs, node_color=node_colors, with_labels=True)

plt.show();

The pyvis library offers more aesthetically pleasing options right out of the box, but can be computationally expensive and does not integrate well from jupyter notebooks in GitHub, so for this analysis, we'll stick with NetworkX.

Besides, we're not really interested in visualizing the connections between writers and genres, we're interested in how writers are connected to each other. To visualize this, we project the bipartite graph and look only at how the writers are connected. We define a function to calculate degree centrality measures from the graph object.

In [14]:
# function to calculate degree centrality measures 
def graph_summary(graph):
    '''
    takes a networkx Graph object and returns a dataframe containing centrality measures
    '''
    degree = pd.Series(dict(nx.degree(graph)),name='degree')
    weight = pd.Series(dict(nx.degree(graph, weight='weight')),name='weight')
    closeness = pd.Series(dict(nx.closeness_centrality(graph)),name='closeness_centrality')
    degree_centrality = pd.Series(dict(nx.degree_centrality(graph)),name='degree_centrality')
    betweenness = pd.Series(dict(nx.betweenness_centrality(graph)),name='betweenness_centrality')
    eigen_centrality = pd.Series(dict(nx.eigenvector_centrality(graph)),name='eigenvector_centrality')
    
    return pd.concat([degree, weight, closeness, degree_centrality, betweenness, eigen_centrality], axis=1).sort_values(by='degree', ascending=False).reset_index().rename(columns={"index":"node"})
In [15]:
w = bi.weighted_projected_graph(G, writers)
writers = graph_summary(w).rename(columns={'node':'writer'})
writers.sort_values(by='degree', ascending=False).head(10)
Out[15]:
writer degree weight closeness_centrality degree_centrality betweenness_centrality eigenvector_centrality
0 Matthew Ismael Ruiz 335 936 0.979943 0.979532 0.005966 0.072258
2 Ben Cardew 334 1048 0.977143 0.976608 0.004873 0.072259
1 Madison Bloom 334 962 0.977143 0.976608 0.004904 0.072255
3 Sam Goldner 333 967 0.974359 0.973684 0.004592 0.072231
5 Vrinda Jagota 331 864 0.968839 0.967836 0.004062 0.072171
4 Larry Fitzmaurice 331 922 0.968839 0.967836 0.005292 0.072047
6 Sasha Geffen 330 888 0.966102 0.964912 0.006053 0.072012
7 Evan Rytlewski 330 900 0.966102 0.964912 0.005027 0.072019
8 Jayson Greene 329 923 0.963380 0.961988 0.004085 0.072004
9 Nadine Smith 329 847 0.963380 0.961988 0.004363 0.071986

Some explanation regarding the centrality measures above,

Degree refers to the number of connections a node -- in this case, writer -- has. It measures the writer's immediate influence on the network, in other words, the total number of writers that written about the same genre.

Weight refers to the strength of connection between two nodes. It is a measure of the amount of interaction between two nodes, ie, the number of times the writers have written about the same genres.

Closeness Centrality is a measure of how close a node is to all other nodes in the network. It is the inverse of the sum of the shortest distances between the node and all other nodes.

Degree Centrality is a measure of the importance of a node based on the number of edges it has. It is the number of edges connected to a node divided by the maximum possible number of edges in the network.

Betweenness Centrality is a measure of how often a node appears on the shortest path between any two other nodes in the network. It captures the idea of gatekeeping or brokerage, where nodes with high betweenness centrality can act as bridges between different parts of the network.

Eigenvector Centrality is a measure of the importance of a node based on the connections it has with other important nodes in the network. It considers both the number and importance of the nodes that a node is connected to. The importance of a node increases if it is connected to other nodes that are themselves important.

We can visualize the graph,

In [44]:
seed=123 # set seed for reproducibility
plt.figure(figsize=[20,12])

# grab the edge weights to use for edge width
weights = [edata['weight']/4 for n,e,edata in w.edges(data=True)]

# grab the degrees to use for node size
degs = [v for k, v in w.degree()]

# Set up the layout for the graph
pos = nx.spring_layout(w)

# Draw the grah with different colors for writers and genres
nx.draw(w, pos, width=weights, node_size=degs,  with_labels=True)

plt.show();

In this graph, the size of the node represents the degree -- the number of total connections. We can see there are some almost indistinguishable-sized nodes, and some very large ones. We can use the island method to partition the graph into its largest cluster so we can identify the most influential nodes.


Island Method¶


We define functions to iterate through the graph and return a table representing the modularity of the network -- how many clusters the network can be parititioned to while remaining connected. Then we can select the desired threshhold with which to reduce the network and create a new graph object including only the nodes above the selected threshold.

In [23]:
def trim_edges(g, weight=1):
    """
    Takes a graph, and applies a threshold (“water level”), 
    letting all edges above a certain value through, and removing all others. 
    
    It returns a copy of the original graph, so it’s non-destructive
    """
    g2 = nx.Graph()
    for f, to, edata in g.edges(data=True):
        if edata['weight'] > weight:
            g2.add_edge(f, to, **edata)
    
    return g2

def island_method(g, iterations=5):
    weights = [edata['weight'] for f, to, edata in g.edges(data=True)]
    mn = int(min(weights))
    mx = int(max(weights))
    # compute the size of the step, so we get a reasonable step in iterations
    step = int((mx-mn)/iterations)
    
    return [[threshold, trim_edges(g, threshold)] for threshold in range(mn,mx,step)]
In [46]:
islands = island_method(w)
print('threshold','\t\t','size','\t\t','conn. components')
for i in islands:
    # print the threshold level, size of the graph, and number of connected components
    print (i[0],'\t\t\t', len(i[1]), '\t\t\t',len(list(nx.connected_components(i[1]))))
threshold 		 size 		 conn. components
1 			 234 			 1
3 			 132 			 1
5 			 73 			 1
7 			 26 			 1
9 			 6 			 1
11 			 3 			 1

We set the threshold to 7 and visualize the top 26 writers,

In [63]:
w_trim = trim_edges(w, weight=7)

plt.figure(figsize=[20,12])

# grab the edge weights to use for edge width
weights = [edata['weight']/8 for n,e,edata in w_trim.edges(data=True)]

# grab the degrees to use for node size
degs = [v*500 for k, v in w_trim.degree()]

# Set up the layout for the graph
pos = nx.spring_layout(w_trim)

# Draw the grah with different colors for writers and genres
nx.draw(w_trim, pos, width=weights, node_size=degs, node_color='lightblue', with_labels=True)

plt.show();

Now we can clearly visualize the most connected writers. Again, this sort of visualization is far more effective with three-dimensional, interactive functionality.

Next, we combine our centrality measures with our sentiment analysis measures to create a master dataset that we can use to observe any relationships that may exist amongst variables.

In [16]:
writers_master = pd.merge(pd.merge(writer_scores, writers_df, on='writer'), writers, on='writer')
writers_master.head()
Out[16]:
writer avg_score total_reviews avg_rating genres pct_total degree weight closeness_centrality degree_centrality betweenness_centrality eigenvector_centrality
0 Abby Jones 0.1463 36 6.80 3 0.0068 208 289 0.716981 0.608187 0.000189 0.052622
1 Abigail Covington 0.2058 10 7.21 3 0.0019 236 401 0.763393 0.690058 0.000587 0.057819
2 Adlan Jackson 0.1093 14 7.10 5 0.0026 308 592 0.909574 0.900585 0.002635 0.069531
3 Aimee Cliff 0.1400 23 7.15 6 0.0043 327 808 0.957983 0.956140 0.003548 0.071947
4 Alex Ramos 0.1470 1 6.50 1 0.0002 163 163 0.653920 0.476608 0.000000 0.040817

Labels Analysis ¶


Now let's take a look at the labels. First we construct a dataframe with each observation containing three columns: a label, a genre, and the number of times the genre occurs. Then we'll instantiate a Graph object and build the bipartite graph with labels and genres as nodes and the number of times they occur as the edge weights between them.

In [100]:
labels_genre.head()
Out[100]:
label genre count
0 1504190 Records Rap 1
1 August Greene, LLC Pop/R&B 1
2 Cohn Corporation Rap 1
3 Daydream Library Series Jazz / Pop/R&B / Rock 1
4 Daydream Library Series Rock 3
In [17]:
labels = list(labels_genre.label.unique()) # list of labels
genres = list(labels_genre.genre.unique()) # list of genres

G = nx.Graph()

G.add_nodes_from(labels, bipartite=0) # add nodes assigning a bipartite level to the two types of nodes
G.add_nodes_from(genres, bipartite=1)

for index, row in labels_genre.iterrows(): # iterate through the dataframe adding the edges and weights
    G.add_edge(row[0], row[1], weight=row[2])
In [95]:
# is the graph connected? 
nx.is_connected(G)
Out[95]:
False
In [97]:
nx.is_bipartite(G)
Out[97]:
True

Unlike the writers graph, see that the labels graph is not connected. This means there are some labels that are not connected to others in any way. Part of the reason for this is, as we know from earlier that 82% of all labels have reviews in one genre. In order to identify the most connected lables, we'll need to find the largest connect component of the network.

In [18]:
# isolate largest connected component
largest_cc = max(nx.connected_components(G), key=len)
# project the largest connected component subgraph onto a new Graph object: S
S = G.subgraph(largest_cc)
# check the subgraph is connected
nx.is_connected(S)
Out[18]:
True
In [108]:
plt.figure(figsize=[21,16])

#colors = range(200)
options = {
    "node_color": "#A0CBE2",
    "edge_color": "lightgrey",
    "width": 0.3,
    "edge_cmap": plt.cm.Blues,
    "with_labels": True,
}

top = nx.bipartite.sets(S)[0]
pos = nx.bipartite_layout(S, top)

nx.draw(S, pos, **options);

The full graph is still impossible to make sense of two-dimensionally, although we can see what look like the four most frequent genres (Rap, Pop/R&B, Electronic / Pop/R&B, and Rock),

In [123]:
print('Top 5 most frequent genres: \n')
print(tabulate(labels_genre.groupby('genre')['count'].sum().reset_index().sort_values(by='count', ascending=False).head(5),headers="keys", tablefmt="github", showindex=False, numalign="right"))
Top 5 most frequent genres: 

| genre        |   count |
|--------------|---------|
| Rock         |    1740 |
| Rap          |     950 |
| Pop/R&B      |     698 |
| Electronic   |     692 |
| Experimental |     500 |

But there's some problems with this, it's not really clear from the graph that Rock is the most frequent genre by far, and there is some ambiguity aroujnd the Electronic and Experimental nodes that blur the distinction between the two. So again, we'll project the bipartite graph to look just at the labels that have the most interactions using the island method:

In [24]:
# list of labels in subgraph
labels_b = [x for x,y in S.nodes(data=True) if y['bipartite']==0]

# project the bipartite graph
l = bi.weighted_projected_graph(S, labels_b)

# island method
islands = island_method(l)
print('threshold','\t\t','size','\t\t','conn. components')
for i in islands:
    # print the threshold level, size of the graph, and number of connected components
    print (i[0],'\t\t\t', len(i[1]), '\t\t\t',len(list(nx.connected_components(i[1]))))
threshold 		 size 		 conn. components
1 			 349 			 1
2 			 153 			 1
3 			 86 			 1
4 			 47 			 1
5 			 20 			 1
6 			 13 			 1
7 			 5 			 1
8 			 2 			 1
9 			 2 			 1

We'll look at the 20 most connected nodes by setting the threshold to 5 and plot the graph.

In [125]:
# trim the graph
l_trim = trim_edges(l, 5)

# plot the graph
plt.figure(figsize=[20,12])

# grab the edge weights to use for edge width
weights = [edata['weight']/8 for n,e,edata in l_trim.edges(data=True)]

# grab the degrees to use for node size
degs = [v*500 for k, v in l_trim.degree()]

# Set up the layout for the graph
pos = nx.spring_layout(l_trim)

# Draw the grah with different colors for writers and genres
nx.draw(l_trim, pos, width=weights, node_size=degs, node_color='lightyellow', with_labels=True)

plt.show();

We see self-released albums at the center of the network, which is a good reference point for any relationship between interaction by genre, as self-released albums are by nature not bound by a single label. The other labels amongst the subgroup consist of labels widely considered to be highly influential.

In order to produce centrality measures for the network and observe them against the rest of the dataset, we need to remove the observations from the dataset that are not connected for our modeling.

In [149]:
not_connected = len(labels_df.loc[-labels_df['label'].isin(labels_b)])
print('{} labels not connected. They are: \n'.format(not_connected)) 
print(tabulate(labels_df[['label','total_reviews']].loc[-labels_df['label'].isin(labels_b)], headers="keys", showindex=False, tablefmt="github"))
11 labels not connected. They are: 

| label                              |   total_reviews |
|------------------------------------|-----------------|
| Monster Boy                        |               1 |
| Mavin / Jonzing World / Virgin     |               1 |
| Season of Mist                     |               1 |
| New Amsterdam / figureight         |               1 |
| Impatience                         |               1 |
| Trojan Jamaica / BMG               |               1 |
| Conmatizque                        |               1 |
| Sector 7-G                         |               1 |
| Youth Sounds / Cadiz Entertainment |               1 |
| Metallic Music/1789                |               1 |
| New Deal / Impulse!                |               1 |
In [20]:
label_master = graph_summary(l).rename(columns={'node':'label'})
label_master.sort_values(by='degree', ascending=False).head(10)
Out[20]:
label degree weight closeness_centrality degree_centrality betweenness_centrality eigenvector_centrality
0 self-released 2013 2619 0.962759 0.961318 0.037651 0.055082
1 Columbia 1926 2400 0.925729 0.919771 0.019705 0.054968
2 XL 1861 2266 0.899871 0.888730 0.018314 0.054410
3 Warp 1798 2160 0.876151 0.858644 0.014295 0.054193
4 Ghostly International 1772 2099 0.866722 0.846227 0.012349 0.054128
5 Loma Vista 1749 2089 0.858197 0.835244 0.015004 0.053283
6 Sub Pop 1715 2070 0.846403 0.819007 0.014870 0.053063
7 Island 1696 2014 0.839952 0.809933 0.014477 0.053003
8 BMG 1651 1925 0.825059 0.788443 0.019116 0.052469
9 Leaving 1635 1893 0.820212 0.780802 0.015001 0.050666
In [21]:
# subset only connected labels
labels_sub = labels_df.loc[labels_df['label'].isin(labels_b)]
labels_sub.head()
Out[21]:
label total_reviews total_genres avg_rating
0 self-released 175 24 7.26
1 Domino 70 8 7.39
2 Sub Pop 58 9 7.17
3 Merge 58 9 7.43
4 Columbia 58 9 6.77
In [22]:
# average score by label
label_score = data_new[data_new['label'].isin(labels_b)].groupby('label')['score'].mean().reset_index().rename(columns={'score':'avg_score'})
# combine dataframes
label_m1 = pd.merge(pd.merge(labels_sub[['label','avg_rating']],label_score, on='label'), label_master, on='label')
label_m1.sort_values(by='degree', ascending=False).head(10)
Out[22]:
label avg_rating avg_score degree weight closeness_centrality degree_centrality betweenness_centrality eigenvector_centrality
0 self-released 7.26 0.138531 2013 2619 0.962759 0.961318 0.037651 0.055082
4 Columbia 6.77 0.125328 1926 2400 0.925729 0.919771 0.019705 0.054968
23 XL 7.62 0.156536 1861 2266 0.899871 0.888730 0.018314 0.054410
13 Warp 7.64 0.138167 1798 2160 0.876151 0.858644 0.014295 0.054193
75 Ghostly International 7.40 0.217250 1772 2099 0.866722 0.846227 0.012349 0.054128
20 Loma Vista 7.01 0.108067 1749 2089 0.858197 0.835244 0.015004 0.053283
2 Sub Pop 7.17 0.180828 1715 2070 0.846403 0.819007 0.014870 0.053063
38 Island 6.24 0.131450 1696 2014 0.839952 0.809933 0.014477 0.053003
34 BMG 6.63 0.151609 1651 1925 0.825059 0.788443 0.019116 0.052469
47 Leaving 7.24 0.203444 1635 1893 0.820212 0.780802 0.015001 0.050666

Now we can attempt to model our data and observe any relationship between review sentiments, network centrality, and ratings.


Predictive Modeling ¶


In this section, we'll take a look at the writer and label centrality measures along with the sentiment scores produced for each review, and observe if any relationship exists between the variables sufficient to employ a regression model that can be used to predict ratings.

First, we'll map the centrality measures to the first dataset. Then we'll visualize scatter plots and construct a correlation matrix to determine if there any observable patterns or correlation amongst the variables.

In [25]:
# subset the data_new dataframe to the largest connected component of the label network
# retain only columns: artist, album, label, writer, rating, score
cols = ['artist','album','label','writer','rating','score']
model_df = data_new.loc[data_new['label'].isin(labels_b)][cols]
In [26]:
# create dictionaries of the centrality measure columns to map to model dataframe

# writers
w_close = dict(zip(writers.writer, writers.closeness_centrality))
w_degree = dict(zip(writers.writer, writers.degree_centrality))
w_bet = dict(zip(writers.writer, writers.betweenness_centrality))
w_eig = dict(zip(writers.writer, writers.eigenvector_centrality))

# labels
l_close = dict(zip(label_master.label, label_master.closeness_centrality))
l_degree = dict(zip(label_master.label, label_master.degree_centrality))
l_bet = dict(zip(label_master.label, label_master.betweenness_centrality))
l_eig = dict(zip(label_master.label, label_master.eigenvector_centrality))

# map dictionaries to new columns in model dataframe
# add centrality measure columns
# writers
model_df['w_close'] = model_df['writer'].map(w_close)
model_df['w_degree'] = model_df['writer'].map(w_degree)
model_df['w_between'] = model_df['writer'].map(w_bet)
model_df['w_eigen'] = model_df['writer'].map(w_eig)

# labels
model_df['l_close'] = model_df['label'].map(l_close)
model_df['l_degree'] = model_df['label'].map(l_degree)
model_df['l_between'] = model_df['label'].map(l_bet)
model_df['l_eigen'] = model_df['label'].map(l_eig)
In [27]:
model_df.head()
Out[27]:
artist album label writer rating score w_close w_degree w_between w_eigen l_close l_degree l_between l_eigen
0 The National First Two Pages of Frankenstein 4AD Brad Shoup 6.6 0.143 0.716981 0.608187 0.000189 0.052622 0.755139 0.677650 0.007824 0.046593
1 Bill Evans Treasures: Solo, Trio and Orchestra Recordings... Elemental Music Mark Richardson 8.1 0.377 0.797203 0.745614 0.001042 0.060945 0.505797 0.047755 0.000000 0.002252
2 Remi Kabaka Son of Africa BBE Music Dean Van Nguyen 7.4 0.246 0.936986 0.935673 0.003821 0.070861 0.689951 0.551576 0.004997 0.044517
3 Portrayal of Guilt Devil Music Run for Cover Marty Sartini Garner 7.4 -0.157 0.800937 0.754386 0.001288 0.061330 0.656015 0.478032 0.001923 0.040450
4 Jim Legxacy HNPM (!) Dylan Green 7.8 0.266 0.738661 0.649123 0.001646 0.050636 0.361908 0.001433 0.000000 0.000019

Now we can visualize some scatter plots with ratings as the y (dependent) variable,

In [49]:
def scatter_figs(df, title):
    '''takes model dataframe and returns scatter plots'''
    fig = make_subplots(rows=3, cols=3, shared_yaxes=True)

    fig.add_trace(go.Scatter(x=df.score, y=df.rating, mode='markers', name="score"),
              row=1, col=1)

    fig.add_trace(go.Scatter(x=df.w_close, y=df.rating, mode='markers', name="writer closeness"),
              row=1, col=2)

    fig.add_trace(go.Scatter(x=df.w_degree, y=df.rating, mode='markers', name="writer degree"),
              row=1, col=3)

    fig.add_trace(go.Scatter(x=df.w_between, y=df.rating, mode='markers', name="writer between"),
              row=2, col=1)

    fig.add_trace(go.Scatter(x=df.w_eigen, y=df.rating, mode='markers', name="writer eigen"),
              row=2, col=2)

    fig.add_trace(go.Scatter(x=df.l_close, y=df.rating, mode='markers', name="label closeness"),
              row=2, col=3)
    
    fig.add_trace(go.Scatter(x=df.l_degree, y=df.rating, mode='markers', name="label degree"),
              row=3, col=1)

    fig.add_trace(go.Scatter(x=df.l_between, y=df.rating, mode='markers', name="label between"),
              row=3, col=2)

    fig.add_trace(go.Scatter(x=df.l_eigen, y=df.rating, mode='markers', name="label eigen"),
              row=3, col=3)

    fig.update_layout(height=600, width=850,
                  title_text=title)
    
    return fig.show("svg")

scatter_figs(model_df, 'Rating as a Function of Predictors')
−0.500.55100.40.60.8100.5100.0050.0151000.020.040.060.40.60.8100.5151000.010.020.030.0400.020.04scorewriter closenesswriter degreewriter betweenwriter eigenlabel closenesslabel degreelabel betweenlabel eigenRating as a Function of Predictors

There does not appear to be a relationship between rating and the predictors we've identified -- we've earlier noted the inconsistency among the sentiment score -- we can see for a few of the variables (specifically the betweenness and eigenvector measures, there may be different levels or categories indicating an interaction term may be appropriate in a regression model, but the absence of any clear correlation is concerning. We can produce a correlation matrix to determine if there is in fact any correlation amongst the predictors and the response.

In [36]:
plt.figure(figsize=[12,8])
corr_matrix = model_df.iloc[:,4:].corr()
sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu")
plt.show();

We observe high collinearity amongst the centrality measures, and nearly 0 correlation between each predictor and the response, eliminating the possibility of producing a predictive model using the VADER sentiment scores and centrality measures based on diversity of genres. For a run through of what a predictive model using the compiled data returns, please see Appendix C.


Appendix A: Web Scraping ¶


Due to the number of pages that required scraping, the task was split into two parts:

  • Part 1: extraction of the 6000 individual review links. GitHub
  • Part 2: scrape and parse each page and build a dataframe from the output. GitHub

Appendix B: Data Tidying ¶


In [2]:
## ALL PACKAGES USED FOR THIS REPORT -- RUN THIS CELL FIRST TO REPRODUCE CODE ##
import pandas as pd 
import numpy as np
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import TrigramCollocationFinder
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import ngrams, bigrams, trigrams
import networkx as nx
from networkx.algorithms import bipartite as bi
from tabulate import tabulate
import re
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.stats.outliers_influence import variance_inflation_factor

pio.renderers.default = "svg"
%matplotlib inline
In [3]:
# loading the full dataset
data = pd.read_csv('data.csv').dropna()

# remove brackets and parenthesis from text
data['review'] = [re.sub('\(.*?\)|\[.*?\]','',x) for x in data['review']]
# remove asterisks
data['review'] = [re.sub("\*", " ", x) for x in data['review']]
# remove double parenthesesis
data['review'] = [x.replace('\\"', ' ') for x in data['review']]
# remove double backslashes
data['review'] = [re.sub("\\\\", ' ', x) for x in data['review']]
# white space after periods
data['review'] = [x.replace(".", '. '). replace(" .", ".").replace("  ", " ").replace("\"", '').replace(",,", ",").replace(" ,", ",") for x in data['review']]
# rename score column: rating 
data.rename(columns={'author':'writer','score':'rating'}, inplace=True)
# releaseYear column as string
data['releaseYear'] = [str(x).replace(".0", "") for x in data['releaseYear']]
In [4]:
## FILTERING OUT REISSUES AND NEW ALBUMS ##
# list of possible words denoting reissue
reissues = ['deluxe','reissue','anniversary', 'Anniversary)','edition']

# filter out reissue reviews - only new reviews
data_new = data[data['releaseYear'] >= '2017']
data_new = data_new[data_new['bestNewReissue'] != True]
data_new = data_new[~data_new['album'].str.lower().isin(reissues)]
data_new = data_new[data_new['artist'] != 'King Crimson']

# isolate the reissues
data_reissues = data[(data['releaseYear'] < '2017') | (data['album'].str.lower().isin(reissues)) | (data['artist'] == 'King Crimson')]
In [5]:
## EXPLORING RATINGS OVER TIME ##
# daily ratings
daily_ratings = data_new[['published','rating']]
daily_ratings['published'] = pd.to_datetime(daily_ratings['published'])
daily_ratings = daily_ratings.groupby('published')['rating'].mean().round(1).reset_index().rename(columns={'rating':'avg_rating'})

# weekly ratings
weekly_ratings = daily_ratings.groupby([pd.Grouper(key='published', freq='W')])['avg_rating'].mean().round(1).reset_index()

# monthly ratings
monthly_ratings = daily_ratings.groupby([pd.Grouper(key='published', freq='M')])['avg_rating'].mean().round(1).reset_index();
/var/folders/82/s2jpfjnx23x79yzgnlx3kx_h0000gp/T/ipykernel_48939/3502261962.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [6]:
## WRITERS ## 
# total number of reviews
writers_reviews = data_new.writer.value_counts().reset_index().rename(columns={'index':'writer','writer':'total_reviews'})
# avg rating
writers_ratings = data_new.groupby('writer')['rating'].mean().round(2).reset_index().rename(columns={'index':'writer','rating':'avg_rating'})
# total number of genres 
writers_genres = data_new.groupby('writer')['genre'].nunique().reset_index().rename(columns={'genre':'genres'})
# combine datasets
writers_df = pd.merge(pd.merge(writers_reviews, writers_ratings, on='writer'), writers_genres, on='writer')
# percentage of total reviews
writers_df['pct_total'] = writers_df['total_reviews'].agg(lambda x: x/len(data_new)).round(4)
In [7]:
## LABELS ## 
# genres by label
labels_genre = pd.DataFrame(pd.Series(data.groupby(['label', 'genre']).size(), name="count")).reset_index()
# number of review by label
label_reviews = pd.Series(data_new.label.value_counts(ascending=False), name='total_reviews').reset_index().rename(columns={'index':'label'})
# average rating per label 
label_ratings = data_new.groupby('label')['rating'].mean().round(2).reset_index().rename(columns={'rating':'avg_rating',})
# total genres by label
total_genres = labels_genre.label.value_counts().reset_index().rename(columns={'index':'label','label':'total_genres'})
# master label
labels_df = pd.merge(pd.merge(label_reviews, total_genres, on='label'), label_ratings, on='label')
# labels with more than ten reviews
top_reviews = labels_df[labels_df['total_reviews'] >= 10].sort_values(by='avg_rating',ascending=False)

Appendix C: Predictive Modeling ¶


In [38]:
# split the model into training and testing sets: 80/20 split
train, test = train_test_split(model_df.iloc[:,4:], test_size=0.2, random_state=42)

Iteration 1¶

In [43]:
X_train = train.drop('rating', axis=1) # predictors
y_train = train['rating'] # response variable

# add constant to predictors
X_train = sm.add_constant(X_train)

# create the regression model
model = sm.OLS(y_train, X_train).fit()

# print the model summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 rating   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     6.868
Date:                Tue, 16 May 2023   Prob (F-statistic):           7.14e-10
Time:                        12:59:02   Log-Likelihood:                -5141.7
No. Observations:                4232   AIC:                         1.030e+04
Df Residuals:                    4222   BIC:                         1.037e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.1131      0.765      7.990      0.000       4.613       7.613
score          0.1342      0.099      1.349      0.177      -0.061       0.329
w_close        4.8105      1.268      3.794      0.000       2.324       7.296
w_degree      -8.1150      1.416     -5.729      0.000     -10.892      -5.338
w_between     14.4671     17.700      0.817      0.414     -20.234      49.168
w_eigen       72.6310     11.785      6.163      0.000      49.526      95.736
l_close       -2.1367      1.269     -1.684      0.092      -4.625       0.351
l_degree       0.7408      0.533      1.390      0.165      -0.304       1.786
l_between     11.9358      5.489      2.174      0.030       1.174      22.698
l_eigen       -3.2812      2.088     -1.572      0.116      -7.374       0.812
==============================================================================
Omnibus:                     1222.847   Durbin-Watson:                   2.020
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4344.028
Skew:                          -1.419   Prob(JB):                         0.00
Kurtosis:                       7.072   Cond. No.                     2.62e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We can see there is an extremely low R-squared and multicollinearity amongst the predictors. We use the SequentialFeatureSelector to identify the predictors with the most significance to the model:

Iteration 2¶

In [44]:
X_train = train.drop('rating', axis=1)
y_train = train['rating']

# instantiate sequential feature selector
selector = SequentialFeatureSelector(estimator=LinearRegression(),
                                     n_features_to_select=3,
                                     direction='forward',
                                     scoring='r2')

# fit the selector to the data
selector.fit(X_train, y_train)

# view selected features
selected_features = X_train.columns[selector.get_support()]
print(selected_features)
Index(['w_between', 'w_eigen', 'l_degree'], dtype='object')
In [45]:
# fit the model using the selected features as predictors
X_train = train[selected_features] # predictors
y_train = train['rating'] # response variable

# add constant to predictors
X_train = sm.add_constant(X_train)

# create the regression model
model = sm.OLS(y_train, X_train).fit()

# print the model summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 rating   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     5.963
Date:                Tue, 16 May 2023   Prob (F-statistic):           0.000471
Time:                        13:01:58   Log-Likelihood:                -5163.5
No. Observations:                4232   AIC:                         1.033e+04
Df Residuals:                    4228   BIC:                         1.036e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.1388      0.095     75.018      0.000       6.952       7.325
w_between    -32.7477     10.663     -3.071      0.002     -53.652     -11.844
w_eigen        2.8215      1.721      1.640      0.101      -0.552       6.195
l_degree      -0.1456      0.052     -2.792      0.005      -0.248      -0.043
==============================================================================
Omnibus:                     1260.309   Durbin-Watson:                   2.016
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4531.775
Skew:                          -1.461   Prob(JB):                         0.00
Kurtosis:                       7.143   Cond. No.                         920.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can see the R-squared has actually gone down, and the writer eigencentrality predictor is not significant to the model.